Periods, Capitalized Words, etc

نویسنده

  • Andrei Mikheev
چکیده

In this article we present an approach for tackling three important aspects of text normalization: sentence boundary disambiguation, disambiguation of capitalized words in positions where capitalization is expected, and identification of abbreviations. As opposed to the two dominant techniques of computing statistics or writing specialized grammars, our document-centered approach works by considering suggestive local contexts and repetitions of individual words within a document. This approach proved to be robust to domain shifts and new lexica and produced performance on the level with the highest reported results. When incorporated into a part-of-speech tagger, it helped reduce the error rate significantly on capitalized words and sentence boundaries. We also investigated the portability to other languages and obtained encouraging results.

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

A Knowledge-free Method for Capitalized Word Disambiguation

In this paper we present an approach to the disambiguation of capitalized words when they are used in the positions where capitalization is expected, such as the first word in a sentence or after a period, quotes, etc.. Such words can act as proper names or can be just capitalized variants of common words. The main feature of our approach is that it uses a minimum of prebuilt resources and trie...

متن کامل

Braille transcription

TRANSCRIBING romanized print into Braille suitable for reading by the blind is a problem which has similarities to those arising in mechanical translation. The theoretical problem of mechanical translation is to construct an operational syntax a set of formal rules of translation prescribing operations to be performed on the text to get the output text entirely in terms of patterns of input wor...

متن کامل

Eye Movements Reveal Interplay Between Noun Capitalization and Word Class During Reading

Subjects' eye movements were recorded while they read sentences for comprehension. Sentences were presented with capitalized nouns—in agreement with German spelling rules—or completely in lowercase. Overall reading speed was not influenced by the manipulation of capitalization, but fixation durations were affected by the interplay between capitalization and the word classes of the fixated and t...

متن کامل

Separating Named Entities

In this paper, we analyze the situation of long sequences of mostly capitalized words which look like a named entity but in fact they consist of several named entities. An example of such phenomena is hokejista (hockey player) New York Rangers Jaromír Jágr. Without splitting the sequence correctly, we will wrongly assume that the whole capitalized sequence is a name of the hockey player. To fin...

متن کامل

Lishe Tanzania Food and Nutrition Journal Tanzania Food and Nutrition Journal

The abstract should include key words, arranged alphabetically with only the first letter of the key word capitalized.

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:
  • Computational Linguistics

دوره 28  شماره 

صفحات  -

تاریخ انتشار 2002